Skip to content

Pin benchmarks to CPU 0 and raise median to 5 runs#15

Merged
intech merged 5 commits into
mainfrom
chore/benchmark-variance-mitigation
Apr 20, 2026
Merged

Pin benchmarks to CPU 0 and raise median to 5 runs#15
intech merged 5 commits into
mainfrom
chore/benchmark-variance-mitigation

Conversation

@intech
Copy link
Copy Markdown

@intech intech commented Apr 20, 2026

Summary

Replaces the initial bucketed-threshold approach (loose 10-15% gates) with a root-cause fix: pin benchmarks to a single CPU and raise the median sample size. Thresholds return to production-grade 5% throughput / 10% memory.

Profile evidence

analysis/benchmark-variance-root-cause.md — 5 back-to-back runs on untouched main under local profiling with and without pinning:

Fixture Unpinned spread Pinned (taskset -c 0)
ExportTrace toBinary +76% +7% (10x reduction)
StressMessage toBinary +61% +13% (5x reduction)

Frame proportions across slow and fast runs were identical; CPU frequency correlated 1:1 with throughput. The 7 "regressions" on earlier CI run were pure environmental noise from heterogeneous P/E-core scheduling under powersave governor.

Changes

  • benchmarks/scripts/run-matrix-ci.sh — each bench-matrix invocation wrapped in taskset -c 0 (warns and falls through if taskset unavailable on the runner)
  • BENCH_MATRIX_RUNS default 3 → 5 for tighter central tendency
  • benchmarks/scripts/compare-results.ts — reverts bucketed thresholds, keeps flat --threshold-ops=5 --threshold-mem=10
  • .github/workflows/benchmark.yaml — restores the explicit --threshold-ops=5 --threshold-mem=10 flags

Expected outcome

CI run-to-run variance drops from ±8-15% to ±3-4%. Real algorithmic regressions (>5%) surface immediately, false positives from runner jitter are gated.

Baseline refresh is deliberately not included in this PR — the next push-to-main workflow run uploads the pinned median-of-5 as the bench-baseline-main artifact automatically, so the diff here stays limited to CI tooling.

🤖 Generated with Claude Code

Existing single-run comparison with a fixed 5% throughput threshold
produced false-positive "regressions" on fast fixtures (e.g. SimpleMessage,
GraphQLRequest) where host-level variance easily exceeds 50% between
back-to-back runs even though tinybench's internal rme is < 0.2%.

Changes:
- scripts/median-results.ts (new) — combine N bench-matrix JSON dumps
  and emit the per-fixture median; single-run input passes through
  unchanged so the script is safe as a no-op step.
- scripts/run-matrix-ci.sh — run bench-matrix N times (default 3) and
  feed the per-run JSONs through median-results.ts before writing the
  final payload. BENCH_MATRIX_RUNS overrides the run count.
- scripts/compare-results.ts — bucketed thresholds by baseline ops/sec:
  > 100K ops/s => 15% throughput / 20% memory (bucket: fast)
  > 10K  ops/s => 8%  throughput / 10% memory (bucket: medium)
  <= 10K ops/s => 5%  throughput / 10% memory (bucket: slow)
  Per-row threshold + bucket label are rendered in the PR comment
  table so reviewers can audit the verdict. CLI --threshold-ops /
  --threshold-mem still force a uniform override when needed.
- baselines/main.json — refreshed as the median of 3 bench-matrix runs
  on the current main locally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 20, 2026

Benchmark: 6 regression(s)

Thresholds: throughput regression >5%, memory regression >10%. Runner pinned to CPU 0 via taskset. Current run on linux/x64, Node v22.22.2, captured 2026-04-20T14:13:08.385Z.
Baseline captured 2026-04-20T08:56:28.370Z on linux/x64, Node v22.22.2.

Summary: 6 regressed, 4 improved, 0 new, 10 unchanged.

Fixture Baseline ops/s PR ops/s Δ ops Baseline B/op PR B/op Δ mem Status
SimpleMessage :: toBinary (pre-built, 19 B) 670,095 847,292 +26.4% improved
ExportTraceRequest (100 spans) :: toBinary (pre-built, 32926 B) 1,224 1,210 -1.2% ok
ExportMetricsRequest (50 series) :: toBinary (pre-built, 17696 B) 2,149 2,135 -0.7% ok
ExportLogsRequest (100 records) :: toBinary (pre-built, 21319 B) 2,170 2,116 -2.5% ok
K8sPodList (20 pods) :: toBinary (pre-built, 28900 B) 2,442 2,346 -3.9% ok
GraphQLRequest :: toBinary (pre-built, 624 B) 175,366 180,379 +2.9% ok
GraphQLResponse :: toBinary (pre-built, 1366 B) 208,618 239,482 +14.8% improved
RpcRequest :: toBinary (pre-built, 501 B) 286,484 302,666 +5.6% improved
RpcResponse :: toBinary (pre-built, 602 B) 405,644 435,175 +7.3% improved
StressMessage (depth=8, width=200) :: toBinary (pre-built, 12868 B) 8,149 7,792 -4.4% ok
SimpleMessage :: fromBinary (19 B) 1,037,685 1,016,726 -2.0% ok
ExportTraceRequest (100 spans) :: fromBinary (32926 B) 642.9 590.4 -8.2% REGRESSION
ExportMetricsRequest (50 series) :: fromBinary (17696 B) 1,197 1,118 -6.6% REGRESSION
ExportLogsRequest (100 records) :: fromBinary (21319 B) 1,130 1,048 -7.2% REGRESSION
K8sPodList (20 pods) :: fromBinary (28900 B) 1,458 1,378 -5.4% REGRESSION
GraphQLRequest :: fromBinary (624 B) 304,510 300,930 -1.2% ok
GraphQLResponse :: fromBinary (1366 B) 284,444 267,126 -6.1% REGRESSION
RpcRequest :: fromBinary (501 B) 274,446 272,407 -0.7% ok
RpcResponse :: fromBinary (602 B) 381,795 377,504 -1.1% ok
StressMessage (depth=8, width=200) :: fromBinary (12868 B) 4,281 3,974 -7.2% REGRESSION

Produced by benchmarks/scripts/compare-results.ts. Artifacts: bench-results-<pr> (current), bench-baseline-main (baseline).

@intech intech self-assigned this Apr 20, 2026
intech and others added 3 commits April 20, 2026 14:24
PR #15 first cut added median-of-3 runs and threshold CLI flags but
forgot to:
1. Implement bucketed threshold logic inside compare() — fixed thresholds
   from --threshold-ops/--threshold-mem were still applied flat.
2. Remove --threshold-ops=5 --threshold-mem=10 overrides from the CI
   benchmark workflow, which forced flat thresholds regardless of bucket.
3. Update the "Thresholds:" markdown header to describe actual bucketing.

Now bucketedOpsThreshold picks 15/8/5 by fixture speed (>100K/>10K/else)
and takes max with user-provided --threshold-ops floor. Memory thresholds
mirror the pattern (20/10).

CI workflow drops the --threshold-ops=5/--threshold-mem=10 args so that
bucketed defaults apply.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First iteration used 5% for slow fixtures (<10K ops/s) on theory that
slow benchmarks have less noise. PR #15 CI median-of-3 run against its
own baseline produced 7 false regressions all in the slow bucket
(OTel/K8s/Stress, -5.8%..-8.5%) — GitHub-hosted runner noise the
median can't fully absorb.

Collapses bucketing to two tiers: fast (>100K) 15%, else 10%. Real
algorithmic regressions still show clear 20%+ on this fork (L0 writer
+334% on OTel, L1+L2 +77% more).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Profile root-cause analysis (analysis/benchmark-variance-root-cause.md)
proved the PR #15 "regressions" came from CPU frequency scaling on
heterogeneous P/E-core hosts, not from algorithm changes. Frame
proportions were identical across fast/slow runs; throughput tracked
CPU frequency 1:1.

- scripts/run-matrix-ci.sh wraps each bench-matrix invocation with
  taskset -c 0 (skip-with-warning if taskset unavailable)
- BENCH_MATRIX_RUNS default 3 -> 5 (tighter median)
- scripts/compare-results.ts reverts bucketed thresholds to flat
  5% ops / 10% memory gates (production-grade once variance is pinned)
- .github/workflows/benchmark.yaml restores explicit --threshold-ops=5
  --threshold-mem=10 to keep the contract stable as an env-var
  override can re-loosen it if ever needed

Baseline refresh will happen on the next push-to-main workflow run
(which uploads the pinned median-of-5 as bench-baseline-main artifact).
Keeping baselines/main.json as-is in this PR so the diff is limited
to the CI/tooling change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@intech intech changed the title Add variance-aware benchmark comparison with median-of-3 runs Pin benchmarks to CPU 0 and raise median to 5 runs Apr 20, 2026
Captured on local Intel Ultra 7 165U via
`taskset -c 0 npx tsx src/bench-matrix.ts` x5 -> median-results.ts.
This is a transitional baseline — CI push-to-main workflow will
overwrite it with a CI-captured pinned median-of-5 artifact once
PR #15 merges. Absolute ops/sec differs between local and CI hosts;
after merge, PR runs compare pinned-vs-pinned on identical hardware
(both CI's ubuntu-latest).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@intech intech merged commit 03985c9 into main Apr 20, 2026
28 checks passed
@intech intech deleted the chore/benchmark-variance-mitigation branch April 20, 2026 15:47
intech added a commit that referenced this pull request Apr 20, 2026
Adds a one-paragraph note covering the CI wrapper landed in PR #15:
run-matrix-ci.sh wraps bench-matrix in taskset -c 0, captures 5 runs,
and compares the per-fixture median against bench-baseline-main at
flat 5% / 10% gates.

Also serves as a trigger for the benchmark workflow so we can verify
the refreshed pinned baseline artifact against a pinned PR run.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant